Background:
ahead = c(5, 12, 19, 26).This notebook:
Next steps:
library(covidcast)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
states_dc_pr_vi = c('al', 'ak', 'az', 'ar', 'ca', 'co', 'ct', 'dc', 'de', 'fl',
'ga', 'hi', 'id', 'il', 'in', 'ia', 'ks', 'ky', 'la', 'me',
'md', 'ma', 'mi', 'mn', 'ms', 'mo', 'mt', 'ne', 'nv', 'nh',
'nj', 'nm', 'ny', 'nc', 'nd', 'oh', 'ok', 'or', 'pa', 'ri',
'sc', 'sd', 'tn', 'tx', 'ut', 'vt', 'va', 'wa', 'wv', 'wi',
'wy', 'pr', 'vi')
start_day = '2020-05-01'
end_day = '2021-12-01'
geo_type = 'state'
hhs_source = 'hhs'
flu_hosp_num = 'confirmed_admissions_influenza_1d_7dav'
flu_hosp_prop = 'confirmed_admissions_influenza_1d_prop_7dav'
covid_hosp_num = 'confirmed_admissions_covid_1d_7dav'
covid_hosp_prop = 'confirmed_admissions_covid_1d_prop_7dav'
jhu_source = 'jhu-csse'
covid_case_num = 'confirmed_7dav_incidence_num'
covid_case_prop = 'confirmed_7dav_incidence_prop'
flu_hosp_prop_data = covidcast_signal(data_source = hhs_source,
signal = flu_hosp_prop,
start_day = start_day,
end_day = end_day,
geo_type = geo_type
) %>% tibble (
) %>% filter (
geo_value %in% states_dc_pr_vi,
) %>% select (
geo_value,
time_value,
value,
) %>% mutate (
signal='flu_hosp_prop',
)
covid_hosp_prop_data = covidcast_signal(data_source = hhs_source,
signal = covid_hosp_prop,
start_day = start_day,
end_day = end_day,
geo_type = geo_type
) %>% tibble (
) %>% filter (
geo_value %in% states_dc_pr_vi,
) %>% select (
geo_value,
time_value,
value,
) %>% mutate (
signal='covid_hosp_prop',
)
covid_case_prop_data = covidcast_signal(data_source = jhu_source,
signal = covid_case_prop,
start_day = start_day,
end_day = end_day,
geo_type = geo_type
) %>% tibble (
) %>% filter (
geo_value %in% states_dc_pr_vi,
) %>% select (
geo_value,
time_value,
value,
) %>% mutate (
signal='covid_case_prop',
)
data_tall = bind_rows(
flu_hosp_prop_data,
covid_hosp_prop_data,
covid_case_prop_data,
)
data_wide = flu_hosp_prop_data %>% select (
geo_value,
time_value,
value_flu_hosp_prop = value,
) %>% inner_join (
covid_hosp_prop_data %>% select (
geo_value,
time_value,
value_covid_hosp_prop = value,
),
by=c('geo_value', 'time_value')
) %>% inner_join(
covid_case_prop_data %>% select (
geo_value,
time_value,
value_covid_case_prop = value,
),
by=c('geo_value', 'time_value')
)
plt = ggplot(data_tall,
aes(x=time_value,
y=value,
color=geo_value),
) + geom_line (
) + facet_wrap (
vars(signal),
nrow=3,
scales = 'free_y',
)
plt
In order to improve the clarity of the plots above, we remove the four states with extreme hospitalization rates (don’t know how to adjust the y limits in a facet_wrap, and I don’t think this substantially changes the conclusions of the EDA.
data_tall %>% filter (
signal == 'flu_hosp_prop',
abs(value) >= 0.2,
) %>% group_by (
geo_value,
) %>% summarize (
count = n(),
) # Nevada, Wyoming, and Massachusetts - reporting errors?
## # A tibble: 4 × 2
## geo_value count
## <chr> <int>
## 1 ma 8
## 2 nv 57
## 3 vi 14
## 4 wy 10
plt = ggplot(data_tall %>% filter (!(geo_value %in% c('nv', 'ma', 'wy', 'vi'))),
aes(x=time_value,
y=value,
color=geo_value),
) + geom_line (
) + facet_wrap (
vars(signal),
nrow=3,
scales = 'free_y',
)
plt
COVID case rates and COVID hospitalization rates both obey rough trends dominated by the “surge” structure. To some extent, hospitalization rates are a lagged version of cases. This effect appears most exaggerated during the holiday season 2020.
The flu hospitalization rate shows little similarity with either of the COVID signals, save for what appears to be an intensification of hospitalizations during the holiday season 2020. The signal appears to largely be noise. (Since we only have flu hospitalization data starting from late-2020, none of the data follows the classical, pre-COVID flu season trajectory).
It is honestly hard to see anything other than a baseline forecaster (i.e., propagate the previous week forward) doing well in this setting.
plt = data_tall %>% filter (
signal == 'flu_hosp_prop',
time_value >= '2021-10-01',
) %>% ggplot (
aes(x=time_value,
y=value,
color=geo_value)
) + geom_line (
)
plt
We trained two forecasters, using forecast dates from October 1, 2021 to October 31, 2021. For each forecast date, we predict 5 days ahead to 28 days ahead. Predictions are made for the seven county-level quantiles prescribed by the COVIDhub. (The CDC requests predictions for 23 quantiles). Training was done on the rates scale; one can transform the predictions to the counts scale (as requested by the CDC) post hoc.
Here, we plot mean WIS’s, aggregated across forecast dates and locations, by ahead.
results = readRDS('results/results.RDS')
plt = results %>% group_by (
forecaster,
ahead,
) %>% summarize (
mean_wis = mean(wis),
) %>% ggplot (
aes(x=ahead, y=mean_wis, color=forecaster)
) + geom_point (
) + geom_line (
) + geom_vline (
xintercept=c(5, 12, 19, 26),
linetype='dotted',
)
plt
Here, we plot population-weighted mean WIS’s, aggregated across forecast dates and locations, by ahead.
state_pop = covidcast::state_census %>% tibble (
) %>% transmute (
geo_value = stringr::str_to_lower(ABBR),
pop = POPESTIMATE2019,
)
results = results %>% left_join (
state_pop,
by='geo_value',
) %>% mutate (
pop_wis = wis * pop,
pop_ae = ae * pop,
)
plt = results %>% group_by (
forecaster,
ahead,
) %>% summarize (
mean_pop_wis = mean(pop_wis) / 1e6,
) %>% ggplot (
aes(x=ahead, y=mean_pop_wis, color=forecaster)
) + geom_point (
) + geom_line (
) + geom_vline (
xintercept=c(5, 12, 19, 26),
linetype='dotted',
)
plt